1  Data Types and Structures

1.1 Data in R

In R, you can work with many different types of data including but not limited to data frames, lists, vectors, and matrices. For the purposes of our course, we are going to be working mostly with data frames. A data frame is a tabular data structure with observations in the rows and variables in the columns. Each of these variables might be stored within the data frame as different levels of data. There are a few tricks in R to identify and change the level of data.

Follow each chunk to examine the a data set.

hyb <- read.csv(file.choose()) # choose hybrid.csv}

First, you can take a look at the data set using a few different functions. Follow the logic in the next chunk to inspect the data.

hyb # call the object you created to see the whole data set.  
     id                        model year      msrp msrp_dollars accel_rate
1     1             Prius (1st gen.) 1997  24509.74  $24,509.74        7.46
2     2                  Tino Hybrid 2000  35354.97  $35,354.97        8.20
3     3             Prius (2nd gen.) 2000  26832.25  $26,832.25        7.97
4     4                      Insight 2000  18936.41  $18,936.41        9.52
5     5        Civic Hybrid 1st gen. 2001  25833.38  $25,833.38        7.04
6     6                      Insight 2001  19036.71  $19,036.71        9.52
7     7                      Insight 2002  19137.01  $19,137.01        9.71
8     8               Alphard Hybrid 2003  38084.77  $38,084.77        8.33
9     9                      Insight 2003  19137.01  $19,137.01        9.52
10   10                 Civic Hybrid 2003  14071.92  $14,071.92        8.62
11   11                Escape Hybrid 2004  36676.10  $36,676.10       10.32
12   12                      Insight 2004  19237.31  $19,237.31        9.35
13   13                        Prius 2004  20355.64  $20,355.64        9.90
14   14      Silverado 15 Hybrid 2WD 2004  30089.64  $30,089.64        9.09
15   15                 Lexus RX400h 2005  58521.14  $58,521.14       12.76
16   16        Civic Hybrid 2nd gen. 2005  26354.44  $26,354.44        7.63
17   17            Highlander Hybrid 2005  29186.21  $29,186.21       12.76
18   18                      Insight 2005  19387.76  $19,387.76        9.71
19   19                 Civic Hybrid 2005  18236.33  $18,236.33        8.26
20   20            Escape Hybrid 2WD 2005  19322.56  $19,322.56        9.52
21   21                Accord Hybrid 2005  16343.69  $16,343.69       14.93
22   22      Silverado 15 Hybrid 2WD 2005  32647.26  $32,647.26       11.11
23   23       Mercury Mariner Hybrid 2006  34772.40  $34,772.40        8.98
24   24                 Camry Hybrid 2006  29853.25  $29,853.25       11.28
25   25                 Lexus GS450h 2006  64547.56  $64,547.56       18.65
26   26                Estima Hybrid 2006  36012.70  $36,012.70        9.26
27   27                Altima Hybrid 2006  29524.75  $29,524.75       13.29
28   28       Chevrolet Tahoe Hybrid 2007  42924.35  $42,924.35       10.91
29   29                Kluger Hybrid 2007  46229.48  $46,229.48       12.76
30   30              Lexus LS600h/hL 2007 118543.60 $118,543.60       17.54
31   31               Tribute Hybrid 2007  24823.83  $24,823.83       11.28
32   32             GMC Yukon Hybrid 2007  57094.81  $57,094.81       12.28
33   33                  Aura Hybrid 2007  22110.87  $22,110.87       10.87
34   34                   Vue Hybrid 2007  22938.33  $22,938.33       10.75
35   35      Silverado 15 Hybrid 2WD 2007  34653.23  $34,653.23       11.49
36   36                 Crown Hybrid 2008  62290.38  $62,290.38        8.70
37   37     Cadillac Escalade Hybrid 2008  78932.81  $78,932.81        9.09
38   38                         F3DM 2008  23744.06  $23,744.06        9.52
39   39                Altima Hybrid 2008  18675.63  $18,675.63       13.70
40   40                       A5 BSG 2009  11849.43  $11,849.43        7.87
41   41                 Lexus RX450h 2009  46233.36  $46,233.36       13.47
42   42                ML450 Blue HV 2009  60519.83  $60,519.83       12.60
43   43             Prius (3rd gen.) 2009  24641.18  $24,641.18        9.60
44   44      S400 Hybrid/Hybrid Long 2009  96208.93  $96,208.93       13.89
45   45         Mercury Milan Hybrid 2009  30522.57  $30,522.57       11.55
46   46                 Lexus HS250h 2009  38478.15  $38,478.15       11.55
47   47           Avante/Elantra LPI 2009  21872.71  $21,872.71       10.21
48   48              ActiveHybrid X6 2009  97237.90  $97,237.90       17.96
49   49                          SAI 2009  39172.44  $39,172.44       11.55
50   50                Malibu Hybrid 2009  24768.79  $24,768.79        9.09
51   51                   Vue Hybrid 2009  26408.67  $26,408.67       13.70
52   52                    Aspen HEV 2009  44903.77  $44,903.77       13.51
53   53                      Durango 2009  41033.24  $41,033.24        8.33
54   54                    Auris HSD 2010  35787.29  $35,787.29        8.85
55   55                         CR-Z 2010  21435.54  $21,435.54        9.24
56   56                    F3DM PHEV 2010  23124.59  $23,124.59        9.24
57   57                   Touareg HV 2010  64198.95  $64,198.95       15.38
58   58                      Audi Q5 2010  37510.86  $37,510.86       14.08
59   59              Jeep Patriot EV 2010  17045.06  $17,045.06       12.05
60   60                  Besturn B50 2010  14586.61  $14,586.61        7.14
61   61        ActiveHybrid 7 Series 2010 104300.43 $104,300.43       20.41
62   62           Lincoln MKZ Hybrid 2010  37036.64  $37,036.64       11.15
63   63              Fit/Jazz Hybrid 2010  16911.85  $16,911.85        8.26
64   64                    Sonata HV 2010  28287.66  $28,287.66       14.70
65   65                 Cayenne S HV 2010  73183.47  $73,183.47       14.71
66   66                      Insight 2010  19859.16  $19,859.16        9.17
67   67    Fuga Hybrid/Infiniti M35h 2010  70157.02  $70,157.02       18.65
68   68               Chevrolet Volt 2010  42924.35  $42,924.35       10.78
69   69           Tribute Hybrid 4WD 2010  27968.32  $27,968.32       12.35
70   70            Fusion Hybrid FWD 2010  28033.51  $28,033.51       11.49
71   71                      HS 250h 2010  34753.53  $34,753.53       11.76
72   72           Mariner Hybrid FWD 2010  30194.95  $30,194.95       11.63
73   73                      RX 450h 2010  42812.54  $42,812.54       13.89
74   74          ML450 Hybrid 4natic 2010  55164.33  $55,164.33       12.99
75   75      Silverado 15 Hybrid 2WD 2010  38454.56  $38,454.56       11.76
76   76                  S400 Hybrid 2010  88212.78  $88,212.78       12.99
77   77                         Aqua 2011  22850.87  $22,850.87        9.35
78   78                 Lexus CT200h 2011  30082.16  $30,082.16        9.71
79   79         Civic Hybrid 3rd gen 2011  24999.59  $24,999.59        9.60
80   80              Prius alpha (V) 2011  30588.35  $30,588.35       10.00
81   81                 3008 Hybrid4 2011  45101.54  $45,101.54       11.36
82   82           Fit Shuttle Hybrid 2011  16394.36  $16,394.36        7.52
83   83          Buick Regal eAssist 2011  27948.93  $27,948.93       12.05
84   84                      Prius V 2011  27272.28  $27,272.28        9.51
85   85     Freed/Freed Spike Hybrid 2011  27972.07  $27,972.07        6.29
86   86                 Optima K5 HV 2011  26549.16  $26,549.16       10.54
87   87            Escape Hybrid FWD 2011  30661.34  $30,661.34       12.35
88   88                      Insight 2011  18254.38  $18,254.38        9.52
89   89               MKZ Hybrid FWD 2011  34748.52  $34,748.52       11.49
90   90                         CR-Z 2011  19402.80  $19,402.80       12.20
91   91                Sonata Hybrid 2011  25872.07  $25,872.07       11.90
92   92                 Camry Hybrid 2011  27130.82  $27,130.82       13.89
93   93           Tribute Hybrid 2WD 2011  26213.09  $26,213.09       12.50
94   94             Cayenne S Hybrid 2011  67902.28  $67,902.28       18.52
95   95               Touareg Hybrid 2011  50149.39  $50,149.39       16.13
96   96              ActiveHybrid 7i 2011 102605.66 $102,605.66       18.18
97   97                      Prius C 2012  19006.62  $19,006.62        9.35
98   98                    Prius PHV 2012  32095.61  $32,095.61        8.82
99   99                       Ampera 2012  31739.55  $31,739.55       11.11
100 100        ActiveHybrid 5 Series 2012  62180.23  $62,180.23       16.67
101 101                 Lexus GS450h 2012  59126.14  $59,126.14       16.95
102 102                      Insight 2012  18555.28  $18,555.28        9.42
103 103               Chevrolet Volt 2012  39261.96  $39,261.96       11.11
104 104              Camry Hybrid LE 2012  26067.66  $26,067.66       13.16
105 105               MKZ Hybrid FWD 2012  34858.84  $34,858.84       11.49
106 106                         M35h 2012  53860.45  $53,860.45       19.23
107 107             LaCrosse eAssist 2012  30049.52  $30,049.52       11.36
108 108        ActiveHybrid 5 Series 2012  61132.11  $61,132.11       17.54
109 109            Panamera S Hybrid 2012  95283.85  $95,283.85       17.54
110 110        Yukon 1500 Hybrid 2WD 2012  52626.77  $52,626.77       13.50
111 111                      Prius C 2013  19080.00  $19,080.00        8.70
112 112                 Jetta Hybrid 2013  24995.00  $24,995.00       12.66
113 113                 Civic Hybrid 2013  24360.00  $24,360.00       10.20
114 114                        Prius 2013  24200.00  $24,200.00       10.20
115 115            Fusion Hybrid FWD 2013  27200.00  $27,200.00       11.72
116 116             C-Max Hybrid FWD 2013  25200.00  $25,200.00       12.35
117 117                      Insight 2013  18600.00  $18,600.00       11.76
118 118              Camry Hybrid LE 2013  26140.00  $26,140.00       13.51
119 119            Camry Hybrid LXLE 2013  27670.00  $27,670.00       13.33
120 120                Sonata Hybrid 2013  25650.00  $25,650.00       11.76
121 121                Optima Hybrid 2013  25900.00  $25,900.00       11.63
122 122        Sonata Hybrid Limited 2013  30550.00  $30,550.00       11.76
123 123             Optima Hybrid EX 2013  31950.00  $31,950.00       11.36
124 124               Malibu eAssist 2013  24985.00  $24,985.00       11.49
125 125             LaCrosse eAssist 2013  31660.00  $31,660.00       11.36
126 126                Regal eAssist 2013  29015.00  $29,015.00       12.20
127 127                      RX 450h 2013  46310.00  $46,310.00       12.99
128 128        Highlander Hybrid 4WD 2013  40170.00  $40,170.00       13.89
129 129                    Q5 Hybrid 2013  50900.00  $50,900.00       14.71
130 130             Cayenne S Hybrid 2013  69850.00  $69,850.00       16.39
131 131               Touareg Hybrid 2013  62575.00  $62,575.00       16.13
132 132          Escalade Hybrid 2WD 2013  74425.00  $74,425.00       11.63
133 133             Tahoe Hybrid 2WD 2013  53620.00  $53,620.00       11.90
134 134        Yukon 1500 Hybrid 2WD 2013  54145.00  $54,145.00       11.88
135 135        Yukon 1500 Hybrid 4WD 2013  61960.00  $61,960.00       13.33
136 136                         CR-Z 2013  19975.00  $19,975.00       11.11
137 137               MKZ Hybrid FWD 2013  35925.00  $35,925.00       14.03
138 138                      CT 200h 2013  32050.00  $32,050.00       10.31
139 139                      ES 300h 2013  39250.00  $39,250.00       12.35
140 140                   ILX Hybrid 2013  28900.00  $28,900.00        9.26
141 141               ActiveHybrid 3 2013  49650.00  $49,650.00       14.93
142 142      Silverado 15 Hybrid 2WD 2013  41135.00  $41,135.00       12.35
143 143         Sierra 15 Hybrid 2WD 2013  41555.00  $41,555.00       10.00
144 144                      GS 450h 2013  59450.00  $59,450.00       16.67
145 145                         M35h 2013  54750.00  $54,750.00       19.61
146 146                  E400 Hybrid 2013  55800.00  $55,800.00       14.93
147 147        ActiveHybrid 5 Series 2013  61400.00  $61,400.00       12.99
148 148              ActiveHybrid 7L 2013  84300.00  $84,300.00       18.18
149 149            Panamera S Hybrid 2013  96150.00  $96,150.00       18.52
150 150                  S400 Hybrid 2013  92350.00  $92,350.00       13.89
151 151         Prius Plug-in Hybrid 2013  32000.00  $32,000.00        9.17
152 152  C-Max Energi Plug-in Hybrid 2013  32950.00  $32,950.00       11.76
153 153 Fusion Energi Plug-in Hybrid 2013  38700.00  $38,700.00       11.76
154 154               Chevrolet Volt 2013  39145.00  $39,145.00       11.11
      mpg mpg_mpge class
1   41.26    41.26     C
2   54.10    54.10     C
3   45.23    45.23     C
4   53.00    53.00    TS
5   47.04    47.04     C
6   53.00    53.00    TS
7   53.00    53.00    TS
8   40.46    40.46    MV
9   53.00    53.00    TS
10  41.00    41.00     C
11  31.99    31.99   SUV
12  52.00    52.00    TS
13  46.00    46.00     M
14  17.00    17.00    PT
15  28.23    28.23   SUV
16  39.99    39.99     C
17  29.40    29.40   SUV
18  52.00    52.00    TS
19  41.00    41.00     C
20  29.00    29.00   SUV
21  28.00    28.00     M
22  17.00    17.00    PT
23  32.93    32.93   SUV
24  33.64    33.64     M
25  33.40    33.40     M
26  47.04    47.04    MV
27  32.93    32.93     M
28  22.35    22.35   SUV
29  25.87    25.87   SUV
30  21.00    21.00     M
31  31.75    31.75   SUV
32  21.78    21.78   SUV
33  27.00    27.00     M
34  26.00    26.00   SUV
35  17.00    17.00    PT
36  37.16    37.16     M
37  22.35    22.35   SUV
38  30.11    85.00     M
39  34.00    34.00     M
40  35.28    35.28     M
41  31.99    31.99   SUV
42  23.99    23.99   SUV
43  47.98    47.98     C
44  26.34    26.34     L
45  40.69    40.69     M
46  54.10    54.10     C
47  41.87    41.87     C
48  18.82    18.82   SUV
49  54.10    54.10     M
50  29.00    29.00     M
51  28.00    28.00   SUV
52  21.00    21.00   SUV
53  21.00    21.00   SUV
54  68.21    68.21     C
55  37.00    37.00    TS
56  30.15    85.00     M
57  28.70    28.70   SUV
58  33.64    33.64   SUV
59  29.40    38.00   SUV
60  31.28    31.28     M
61  22.11    22.11     L
62  37.63    37.63     M
63  30.00    30.00     C
64  37.00    37.00     M
65  26.11    26.11   SUV
66  41.00    41.00     C
67  33.64    33.64     M
68  35.00    93.00     C
69  29.00    29.00   SUV
70  39.00    39.00     M
71  35.00    35.00     C
72  32.00    32.00   SUV
73  30.00    30.00   SUV
74  22.00    22.00   SUV
75  22.00    22.00    PT
76  21.00    21.00     L
77  50.00    50.00     C
78  42.00    42.00     C
79  44.36    44.36     C
80  72.92    72.92     M
81  61.16    61.16     C
82  58.80    58.80    MV
83  25.99    25.99     M
84  32.93    32.93     M
85  50.81    50.81    MV
86  36.00    36.00     M
87  32.00    32.00   SUV
88  41.00    41.00     C
89  39.00    39.00     M
90  37.00    37.00    TS
91  36.00    36.00     M
92  33.00    33.00     M
93  32.00    32.00   SUV
94  21.00    21.00   SUV
95  21.00    21.00   SUV
96  20.00    20.00     M
97  50.00    50.00     C
98  50.00    95.00     M
99  37.00    98.00     C
100 26.00    26.00     M
101 31.00    31.00     M
102 42.00    42.00     C
103 37.00    94.00     C
104 41.00    41.00     M
105 39.00    39.00     M
106 29.00    29.00     M
107 29.00    29.00     M
108 26.00    26.00     M
109 25.00    25.00     L
110 21.00    21.00   SUV
111 50.00    50.00     C
112 45.00    45.00     C
113 44.00    44.00     C
114 50.00    50.00     M
115 47.00    47.00     M
116 43.00    43.00     L
117 42.00    42.00     C
118 41.00    41.00     M
119 40.00    40.00     M
120 38.00    38.00     M
121 38.00    38.00     M
122 37.00    37.00     M
123 37.00    37.00     M
124 29.00    29.00     M
125 29.00    29.00     M
126 29.00    29.00     M
127 30.00    30.00   SUV
128 28.00    28.00   SUV
129 26.00    26.00   SUV
130 21.00    21.00   SUV
131 21.00    21.00   SUV
132 21.00    21.00   SUV
133 21.00    21.00   SUV
134 21.00    21.00   SUV
135 21.00    21.00   SUV
136 37.00    37.00    TS
137 45.00    45.00     M
138 42.00    42.00     C
139 40.00    40.00     M
140 38.00    38.00     C
141 28.00    28.00     C
142 21.00    21.00    PT
143 21.00    21.00    PT
144 31.00    31.00     M
145 29.00    29.00     M
146 26.00    26.00     M
147 26.00    26.00     M
148 25.00    25.00     L
149 25.00    25.00     L
150 21.00    21.00     L
151 50.00    95.00     M
152 43.00   100.00     M
153 43.00   100.00     M
154 37.00    98.00     C
head(hyb, 5) # Take a look at the first 5 observations in the set (you can set the number) 
  id                 model year     msrp msrp_dollars accel_rate   mpg mpg_mpge
1  1      Prius (1st gen.) 1997 24509.74  $24,509.74        7.46 41.26    41.26
2  2           Tino Hybrid 2000 35354.97  $35,354.97        8.20 54.10    54.10
3  3      Prius (2nd gen.) 2000 26832.25  $26,832.25        7.97 45.23    45.23
4  4               Insight 2000 18936.41  $18,936.41        9.52 53.00    53.00
5  5 Civic Hybrid 1st gen. 2001 25833.38  $25,833.38        7.04 47.04    47.04
  class
1     C
2     C
3     C
4    TS
5     C
tail(hyb, 10) # Look at the last 10 observations (you can set the number)
     id                        model year  msrp msrp_dollars accel_rate mpg
145 145                         M35h 2013 54750  $54,750.00       19.61  29
146 146                  E400 Hybrid 2013 55800  $55,800.00       14.93  26
147 147        ActiveHybrid 5 Series 2013 61400  $61,400.00       12.99  26
148 148              ActiveHybrid 7L 2013 84300  $84,300.00       18.18  25
149 149            Panamera S Hybrid 2013 96150  $96,150.00       18.52  25
150 150                  S400 Hybrid 2013 92350  $92,350.00       13.89  21
151 151         Prius Plug-in Hybrid 2013 32000  $32,000.00        9.17  50
152 152  C-Max Energi Plug-in Hybrid 2013 32950  $32,950.00       11.76  43
153 153 Fusion Energi Plug-in Hybrid 2013 38700  $38,700.00       11.76  43
154 154               Chevrolet Volt 2013 39145  $39,145.00       11.11  37
    mpg_mpge class
145       29     M
146       26     M
147       26     M
148       25     L
149       25     L
150       21     L
151       95     M
152      100     M
153      100     M
154       98     C

So, we have quite a few numeric and categorical variables here. We need to know how each of these variables are stored in this data set so we know how to work with them. The following chunk uses the str() function to take a look at how this data set is structured.

str(hyb)
'data.frame':   154 obs. of  9 variables:
 $ id          : int  1 2 3 4 5 6 7 8 9 10 ...
 $ model       : chr  "Prius (1st gen.)" "Tino Hybrid" "Prius (2nd gen.)" "Insight" ...
 $ year        : int  1997 2000 2000 2000 2001 2001 2002 2003 2003 2003 ...
 $ msrp        : num  24510 35355 26832 18936 25833 ...
 $ msrp_dollars: chr  "$24,509.74 " "$35,354.97 " "$26,832.25 " "$18,936.41 " ...
 $ accel_rate  : num  7.46 8.2 7.97 9.52 7.04 9.52 9.71 8.33 9.52 8.62 ...
 $ mpg         : num  41.3 54.1 45.2 53 47 ...
 $ mpg_mpge    : num  41.3 54.1 45.2 53 47 ...
 $ class       : chr  "C" "C" "C" "TS" ...

Starting from the top right, we have a data.frame (a type of data structure) that has 154 observations (the rows) with 9 variables (columns). under this line is an explanation of each of the 9 variables. From left to right, we have the name of the variable, the type, then a short example of the data stored therein. For example, we have an ‘id variable’ stored as an integer (int) which. The “model” variable is stored as a character (chr) variable which indicates that it is stored as text (also known as a string). There is one more variable class you should know which is factor. While characters are text, factors are categories with a set number of possible values.

Notice the $? The $ is an operator used in R to access different elements in an object. This comes in handy when we want to work with the data and transform it. For example, we may wish to view certain elements of this data frame. Follow the logic in the following chunk.

class(hyb$id) # identify the class of data using class()
[1] "integer"
hyb$id_chr <- as.character(hyb$id) # change a variable to a character create a new variable with that. 

class(hyb$id_chr)
[1] "character"

One final thing to note when working with character data is that often you need to convert characters into factors so that R recognises the long list of text as being truly categorical. This actually encodes each unique character as a distinct category recognising all with the same text as sharing a category.

For example, the variable ‘class’ is actually categorical. What type of car it is. It is, however, stored as a character. In order to do anything with this variables (say visualising average cost by each category?), we need to convert this into a factor.

hyb$class <- as.factor(hyb$class)

class(hyb$class)
[1] "factor"

Keep this trick in your pocket! You are likely going to need this throughout the semester!

1.2 Levels of Data

Within a data set you will encounter different variables that are measures at various levels and using different units of measurement. Let’s say, for example, you have some survey data that asks questions about the respondent’s biological sex, income, and how satisfied they are are work. All of these questions are useful, and can be useful in visualisation. However, there are some visualisations that are more appropriate and useful for some of these more than others. In order to properly visualise data, we need to understand data.

There are two main ‘umbrella’ terms that you can use when talking about data. These are, categorical and numeric. Categorical data, as the name suggests, are measured in buckets or categories while numeric data use units and numbers. There are a few further distinctions you need to understand before these become useful to you.

Categorical Data

Nominal: These are data with distinct labels that have no quantitative difference between one another.

E.g. Sex (Male, Female). Race (White, Black, Other).

Ordinal: These are data with set differences between each response. These are categorical responses that are ranked in a specific order.

E.g. Likert Scale (Agree, Neutral, Disagree).

There are some other variations of categorical data that are sometimes referred to such as dummy variables (true or false, or 0/1). So, an honorable mention goes to dummy variables!!

Categorical data will almost always be stored as characters or factors. Alternatively, you might come across encoded versions of categorical data. For example, male and female may be given a numeric code but, we know this to be categorical. So, you must decide what to do. You may want to convert this to a categorical variable, or simply remember what 1 and 0 mean.

Since there are distinct buckets of information that are stored in categorical data, it is best presented using tables, bar charts or pie charts.

Numeric Data

Interval: Continuous data that do not have a zero point.

E.g. Temperature (measured in Farenheit), Time (measured on a 12-hour clock, ACT scores).

Ratio: Continuous data that have a true zero point.

E.g. Earnings (dollar amount), Age (measured in years).

Numeric data all have equal intervals (i.e. one decimal place, or one year, or one degree) which creates a continuous stream of data.

In R, numeric data is stored as integer (int) or numeric (num). You may come across data that should be numeric but is stored as categorical or perhaps a character.

1.2.1 Activity - Levels of Data

Look at the following examples of questions and, with a partner, decide whether the unit of measurement is nominal, ordinal, interval or ratio.

  1. Please indicate how much you earn a year from your current job: - $0 - $24,999
-   $25,000 - $49,999

-   $50,000 - $74,999

-   $75,000 - $99,999

-   $100,000+
  1. How much do you earn at your current job (in USD): _____________

  2. How likely are you to recommend this product?:

    1. Likely
    2. Neutral
    3. Unlikely